GffRead to filter Trinity-derived
GMAP-aligned gff3 files.featureCountshtseq-count
to perform feature counting with respect to types “locus”,
“mRNA”, “exon”, and “CDS” features from “base” calls to
GffRead, and with respect to type “gene” from
“intron-filtering-only” calls to GffRead.gtf files using
custom R code; perform processing as necessary using a mix
of bioconductor and tidyverse programs.rtracklayergtf files.gtf files to Alison Greenlaw for
inspection.Trinity parameter mkc = 1, 4, 8. However,
for the sake of completeness, and to understand clearly what
“incorrectness” looks like, I will filter mkc = 2, 16, 32
as well. Remember that the Trinity parameters
mir, mg, and gf were held at
default levels for the specific assemblies to surveyed by Alison.
gffread#!/bin/bash
# "Base" call to GffRead
gffread \
-v \
-g "${fasta_g}" \
-i 1000 \
-Z \
-M -K -Q \
-F -N -P \
--force-exons --gene2exon \
-o "${out}" \
<(awk -F '\t' 'BEGIN {OFS = FS} { gsub(/chr/, "", $1); gsub(/M/, "Mito", $1); print }' "${in}") \
> >(tee -a "${err_out%.}.stdout.txt") \
2> >(tee -a "${err_out%.}.stderr.txt")
# "Intron-filtering-only" call to GffRead
#+
#+ No collapsing, merging with these files; only filtering to exclude exonic
#+ features with introns greater than 1000 bp in length
gffread \
-v -O \
-i 1000 \
-o "${out/.gff3/-intron-filtering-only.gff3}" \
<(awk -F '\t' 'BEGIN {OFS = FS} { gsub(/chr/, "", $1); gsub(/M/, "Mito", $1); print }' "${in}") \
> >(tee -a "${err_out%.}-intron-filtering-only.stdout.txt") \
2> >(tee -a "${err_out%.}-intron-filtering-only.stderr.txt")
GffRead-v expose (warn about) duplicate transcript IDs and
other potential problems with the given GFF/GTF records-g full path to a fasta file with the genomic sequences
for all input mappings (one per genomic sequence, with file names
matching sequence names) (#NOTE i.e., the fasta - file output from
running Trinity in genome-guided mode)-i discard transcripts having an intron larger than
1000 bp-Z merge very close exons into a single exon (when
intron size<4)-M cluster the input transcripts into loci, discarding
“redundant” transcripts (those with the same exact introns and fully
contained or equal boundaries)-K for the -M option, also discard as redundant the
shorter, fully contained transcripts (intron chains matching a part of
the container)-Q for the -M option, no longer require boundary
containment when assessing redundancy (can be combined with -K); only
introns have to match for multi-exon transcripts, and >=80% overlap
for single-exon transcripts-F keep all GFF attributes (for non-exon features)-N discard multi-exon mRNAs that have any intron with a
non-canonical splice site consensus (i.e., not GT-AG, GC-AG or
AT-AC)-P add transcript level GFF attributes about the coding
status of each transcript, including partialness or in-frame stop codons
(requires -g)--force-exons make sure that the lowest level GFF
features are considered “exon” features (#NOTE This is standard in gff3
and gtf files)-o write the output records into GffRead-v expose (warn about) duplicate transcript IDs and
other potential problems with the given GFF/GTF records-O process other non-transcript GFF records (by default
non-transcript records are ignored) (#NOTE This keeps
‘gene’ features (output by gmap) in the gff3/gtf, although “gene” is
exactly the same as “mRNA” as output by gmap)-i discard transcripts having an intron larger than
1000 bp-o write the output records into htseq-counttype “locus”As a result of calling gffread with the -Z
-M -K and -Q arguments, a new
feature is added to the gff3/gtf files: “locus”. “locus” is a putative
parent feature made from collapsing and joining “children” features
(hierarchically, “mRNA” and “exon”) as described for -Z
-M -K and -Q. This is an
extension of the kind of collapsing performed by
gffcompare -C. Thus, counts matrices were created with
respect to these “locus” features.
#!/bin/bash
sbatch \
--job-name="htseq-count-locus" \
--nodes=1 \
--cpus-per-task=8 \
--error="${err_out}-locus.%A.stderr.txt" \
--output="${err_out}-locus.%A.stdout.txt" \
htseq-count \
--order "pos" \
--stranded "${hc_strd}" \
--nonunique "all" \
--type "locus" \
--idattr "ID" \
--nprocesses 8 \
--counts_output "${out/.tsv/-locus.tsv}" \
--with-header \
${bams[*]} \
"${in}"
type “mRNA”Create the counts matrix by looking at children “mRNA” features
rather than parent “locus” features. This is the kind of collapsing
performed by gffcompare -C.
#!/bin/bash
sbatch \
--job-name="htseq-count-mRNA" \
--nodes=1 \
--cpus-per-task=8 \
--error="${err_out}-mRNA.%A.stderr.txt" \
--output="${err_out}-mRNA.%A.stdout.txt" \
htseq-count \
--order "pos" \
--stranded "${hc_strd}" \
--nonunique "all" \
--type "mRNA" \
--idattr "ID" \
--nprocesses 8 \
--counts_output "${out/.tsv/-mRNA.tsv}" \
--with-header \
${bams[*]} \
"${in}"
type “exon”Create the counts matrix by looking at children “exon” features.
#!/bin/bash
sbatch \
--job-name="htseq-count-exon" \
--nodes=1 \
--cpus-per-task=8 \
--error="${err_out}-exon.%A.stderr.txt" \
--output="${err_out}-exon.%A.stdout.txt" \
htseq-count \
--order "pos" \
--stranded "${hc_strd}" \
--nonunique "all" \
--type "exon" \
--idattr "Parent" \
--nprocesses 8 \
--counts_output "${out/.tsv/-exon.tsv}" \
--with-header \
${bams[*]} \
"${in}"
type “CDS”gffread scans the fastas supplied
-g and roughly estimates start and stop codons from the nt
sequences; it then adds these CDS estimates to the
gff3/gtf files. Create the counts matrix by
looking at these estimated “CDS” features.
#!/bin/bash
sbatch \
--job-name="htseq-count-CDS" \
--nodes=1 \
--cpus-per-task=8 \
--error="${err_out}-CDS.%A.stderr.txt" \
--output="${err_out}-CDS.%A.stdout.txt" \
htseq-count \
--order "pos" \
--stranded "${hc_strd}" \
--nonunique "all" \
--type "CDS" \
--idattr "Parent" \
--nprocesses 8 \
--counts_output "${out/.tsv/-CDS.tsv}" \
--with-header \
${bams[*]} \
"${in}"
type “gene” (“intron-only-filtering”
call to GffRead)Create the counts matrix by looking at parent “gene” features from
the files output in section “On calling gffread”
above.
#!/bin/bash
sbatch \
--job-name="htseq-count-gene" \
--nodes=1 \
--cpus-per-task=8 \
--error="${err_out}-gene.%A.stderr.txt" \
--output="${err_out}-gene.%A.stdout.txt" \
htseq-count \
--order "pos" \
--stranded "${hc_strd}" \
--nonunique "all" \
--type "gene" \
--idattr "ID" \
--nprocesses 8 \
--counts_output "${out/.tsv/-gene.tsv}" \
--with-header \
${bams[*]} \
"${in}"
gtf/gff3 files (as
dataframes)gtfs #!/bin/bash
mkdir -p /home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410/{gtf-gff3,tsv}
# mkdir: created directory '/home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410'
# mkdir: created directory '/home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410/gtf-gff3'
# mkdir: created directory '/home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410/tsv'
cd /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0215
cp -r \
outfiles_gtf-gff3/Trinity-GG \
/home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410/gtf-gff3
cp -r \
outfiles_htseq-count/Trinity-GG \
/home/kalavatt/tsukiyamalab/alisong/KA.gtfs_quantile-filtered_Trinity-GG_G1-Q_2023-0410/tsv
# Manually copying in the 'filtered/' subdirectories from local to remote
See work_assessment-processing_gtfs_part-0.md,
where steps 0 through 3 are detailed. Below, we pick up with
step 4.
#START
#!/usr/bin/env Rscript
library(GenomicRanges)
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
library(IRanges)
library(readxl)
library(rtracklayer)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.4.2 ✔ purrr 1.0.1
✔ tibble 3.2.1 ✔ dplyr 1.1.1
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.4 ✔ forcats 1.0.0Warning: package ‘ggplot2’ was built under R version 4.2.3Warning: package ‘tibble’ was built under R version 4.2.3Warning: package ‘dplyr’ was built under R version 4.2.3── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::collapse() masks IRanges::collapse()
✖ dplyr::combine() masks BiocGenerics::combine()
✖ dplyr::desc() masks IRanges::desc()
✖ tidyr::expand() masks S4Vectors::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::first() masks S4Vectors::first()
✖ dplyr::lag() masks stats::lag()
✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
✖ dplyr::rename() masks S4Vectors::rename()
✖ dplyr::slice() masks IRanges::slice()
options(scipen = 999)
options(ggrepel.max.overlaps = Inf)
if(stringr::str_detect(getwd(), "kalavattam")) {
p_local <- "/Users/kalavattam/Dropbox/FHCC"
} else {
p_local <- "/Users/kalavatt/projects-etc"
}
p_wd <- "2022_transcriptome-construction/results/2023-0215"
setwd(paste(p_local, p_wd, sep = "/"))
getwd()
[1] "/Users/kalavatt/projects-etc/2022_transcriptome-construction/results/2023-0215"
rm(p_local, p_wd)
gtf files (as
data.frames)gtf files (as
data.frames)
#!/usr/bin/env Rscript
# Initialize functions ---------------
import_g <- function(file) {
rtracklayer::import(file)
}
read_in_gtfs <- function(vector_files) {
list <- sapply(
vector_files,
import_g,
simplify = FALSE,
USE.NAMES = TRUE
)
names(list) <- paste0("k", c(1, 16, 2, 32, 4, 8))
df <- sapply(
list,
as.data.frame,
simplify = FALSE,
USE.NAMES = TRUE
)
return(df)
}
# G1 ---------------------------------
path_gtf_G <- "outfiles_gtf-gff3/Trinity-GG/G_N"
files_G <- list.files(
path_gtf_G,
pattern = "*.gffread.gff3",
full.names = TRUE
)
l_G_gtf <- read_in_gtfs(files_G)
files_G_introns_filtered <- list.files(
path_gtf_G,
pattern = "*.gffread-intron-filtering-only.gff3",
full.names = TRUE
)
l_G_gtf_introns_filtered <- read_in_gtfs(files_G_introns_filtered)
# Q ----------------------------------
path_gtf_Q <- "outfiles_gtf-gff3/Trinity-GG/Q_N"
files_Q <- list.files(
path_gtf_Q,
pattern = "*.gffread.gff3",
full.names = TRUE
)
l_Q_gtf <- read_in_gtfs(files_Q)
files_Q_introns_filtered <- list.files(
path_gtf_Q,
pattern = "*.gffread-intron-filtering-only.gff3",
full.names = TRUE
)
l_Q_gtf_introns_filtered <- read_in_gtfs(files_Q_introns_filtered)
# Comparisons ------------------------
# l_G_gtf[["k1"]] %>% head()
# l_G_gtf[["k2"]] %>% head()
#
# l_G_gtf[["k1"]] %>% head()
# l_Q_gtf[["k1"]] %>% head()
#
# l_Q_gtf[["k1"]] %>% head()
# l_Q_gtf[["k2"]] %>% head()
#!/usr/bin/env Rscript
# Initialize functions ---------------
read_in_mat <- function(vector_of_files) {
out_list <- sapply(
vector_of_files,
readr::read_tsv,
simplify = FALSE,
USE.NAMES = TRUE
)
names(out_list) <- paste0("k", c(1, 16, 2, 32, 4, 8))
return(out_list)
}
rename_columns <- function(list) {
lapply(
list,
function(df) {
names(df) <- gsub("bams_renamed\\/UT_prim_UMI\\/WT_", "", names(df)) %>%
gsub("_day1_ovn_", "_", .) %>%
gsub("_day7_ovn_", "_", .) %>%
gsub("_aux-F_tc-F_", "_", .) %>%
gsub("_tech1\\.UT_prim_UMI\\.bam", "", .) %>%
gsub("^\\.\\.\\.1", "id", .)
return(df)
}
)
}
# G1 ---------------------------------
path_mat_G <- "outfiles_htseq-count/Trinity-GG/G_N"
files_G_locus <- list.files(
path_mat_G, pattern = "*-locus.tsv", full.names = TRUE
)
l_G_mat_locus <- read_in_mat(files_G_locus) %>%
suppressMessages()
files_G_mRNA <- list.files(
path_mat_G, pattern = "*-mRNA.tsv", full.names = TRUE
)
l_G_mat_mRNA <- read_in_mat(files_G_mRNA) %>%
suppressMessages()
files_G_exon <- list.files(
path_mat_G, pattern = "*-exon.tsv", full.names = TRUE
)
l_G_mat_exon <- read_in_mat(files_G_exon) %>%
suppressMessages()
files_G_CDS <- list.files(
path_mat_G, pattern = "*-CDS.tsv", full.names = TRUE
)
l_G_mat_CDS <- read_in_mat(files_G_CDS) %>%
suppressMessages()
files_G_introns_filtered <- list.files(
path_mat_G,
pattern = "*.gffread-intron-filtering-only.hc-strd-eq-gene.tsv",
full.names = TRUE
)
l_G_mat_introns_filtered <- read_in_mat(files_G_introns_filtered) %>%
suppressMessages()
# Q ----------------------------------
path_mat_Q <- "outfiles_htseq-count/Trinity-GG/Q_N"
files_Q_locus <- list.files(
path_mat_Q, pattern = "*-locus.tsv", full.names = TRUE
)
l_Q_mat_locus <- read_in_mat(files_Q_locus) %>%
suppressMessages()
files_Q_mRNA <- list.files(
path_mat_Q, pattern = "*-mRNA.tsv", full.names = TRUE
)
l_Q_mat_mRNA <- read_in_mat(files_Q_mRNA) %>%
suppressMessages()
files_Q_exon <- list.files(
path_mat_Q, pattern = "*-exon.tsv", full.names = TRUE
)
l_Q_mat_exon <- read_in_mat(files_Q_exon) %>%
suppressMessages()
files_Q_CDS <- list.files(
path_mat_Q, pattern = "*-CDS.tsv", full.names = TRUE
)
l_Q_mat_CDS <- read_in_mat(files_Q_CDS) %>%
suppressMessages()
files_Q_introns_filtered <- list.files(
path_mat_Q,
pattern = "*.gffread-intron-filtering-only.hc-strd-eq-gene.tsv",
full.names = TRUE
)
l_Q_mat_introns_filtered <- read_in_mat(files_Q_introns_filtered) %>%
suppressMessages()
# Comparisons ------------------------
# l_Q_mat_sam[["k1"]] %>% head()
# l_Q_mat_sam[["k2"]] %>% head()
#
# l_Q_mat_sam[["k1"]] %>% head()
# l_Q_mat_rev[["k1"]] %>% head()
# Rename columns ---------------------
#+ ...in each dataframe composing each list
l_G_mat_locus <- rename_columns(l_G_mat_locus)
l_G_mat_mRNA <- rename_columns(l_G_mat_mRNA)
l_G_mat_exon <- rename_columns(l_G_mat_exon)
l_G_mat_CDS <- rename_columns(l_G_mat_CDS)
l_G_mat_introns_filtered <- rename_columns(l_G_mat_introns_filtered)
l_Q_mat_locus <- rename_columns(l_Q_mat_locus)
l_Q_mat_mRNA <- rename_columns(l_Q_mat_mRNA)
l_Q_mat_exon <- rename_columns(l_Q_mat_exon)
l_Q_mat_CDS <- rename_columns(l_Q_mat_CDS)
l_Q_mat_introns_filtered <- rename_columns(l_Q_mat_introns_filtered)
#!/usr/bin/env Rscript
# Initialize function ----------------
filter_transfrags <- function(x, y) {
df <- x
if(y == "G1" | y == "G") {
df <- x[, 1:3] %>% head(., -5)
} else if(y == "Q") {
df <- x[, c(1, 6, 7)] %>% head(., -5)
} else if(y != "G1" | y != "G" | y != "Q") {
stop("Parameter y must be either 'G1', 'G', or 'Q'")
}
df[, 4] <- df[, 2] + df[, 3]
colnames(df)[4] <- "sum"
ids <- list()
ids$percentiles <- quantile(
df$sum,
probs = c(
0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50,
0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95
)
) %>%
round()
df$below_05 <- ifelse(df$sum <= ids$percentiles[1], TRUE, FALSE)
df$below_10 <- ifelse(df$sum <= ids$percentiles[2], TRUE, FALSE)
df$below_15 <- ifelse(df$sum <= ids$percentiles[3], TRUE, FALSE)
df$below_20 <- ifelse(df$sum <= ids$percentiles[4], TRUE, FALSE)
df$below_25 <- ifelse(df$sum <= ids$percentiles[5], TRUE, FALSE)
df$below_30 <- ifelse(df$sum <= ids$percentiles[6], TRUE, FALSE)
df$below_35 <- ifelse(df$sum <= ids$percentiles[7], TRUE, FALSE)
df$below_40 <- ifelse(df$sum <= ids$percentiles[8], TRUE, FALSE)
df$below_45 <- ifelse(df$sum <= ids$percentiles[9], TRUE, FALSE)
df$below_50 <- ifelse(df$sum <= ids$percentiles[10], TRUE, FALSE)
df$below_55 <- ifelse(df$sum <= ids$percentiles[11], TRUE, FALSE)
df$below_60 <- ifelse(df$sum <= ids$percentiles[12], TRUE, FALSE)
df$below_65 <- ifelse(df$sum <= ids$percentiles[13], TRUE, FALSE)
df$below_70 <- ifelse(df$sum <= ids$percentiles[14], TRUE, FALSE)
df$below_75 <- ifelse(df$sum <= ids$percentiles[15], TRUE, FALSE)
df$below_80 <- ifelse(df$sum <= ids$percentiles[16], TRUE, FALSE)
df$below_85 <- ifelse(df$sum <= ids$percentiles[17], TRUE, FALSE)
df$below_90 <- ifelse(df$sum <= ids$percentiles[18], TRUE, FALSE)
df$below_95 <- ifelse(df$sum <= ids$percentiles[19], TRUE, FALSE)
ids$below_05 <- df[df$below_05 == FALSE, ]$id
ids$below_10 <- df[df$below_10 == FALSE, ]$id
ids$below_15 <- df[df$below_15 == FALSE, ]$id
ids$below_20 <- df[df$below_20 == FALSE, ]$id
ids$below_25 <- df[df$below_25 == FALSE, ]$id
ids$below_30 <- df[df$below_30 == FALSE, ]$id
ids$below_35 <- df[df$below_35 == FALSE, ]$id
ids$below_40 <- df[df$below_40 == FALSE, ]$id
ids$below_45 <- df[df$below_45 == FALSE, ]$id
ids$below_50 <- df[df$below_50 == FALSE, ]$id
ids$below_55 <- df[df$below_55 == FALSE, ]$id
ids$below_60 <- df[df$below_60 == FALSE, ]$id
ids$below_65 <- df[df$below_65 == FALSE, ]$id
ids$below_70 <- df[df$below_70 == FALSE, ]$id
ids$below_75 <- df[df$below_75 == FALSE, ]$id
ids$below_80 <- df[df$below_80 == FALSE, ]$id
ids$below_85 <- df[df$below_85 == FALSE, ]$id
ids$below_90 <- df[df$below_90 == FALSE, ]$id
ids$below_95 <- df[df$below_95 == FALSE, ]$id
return(ids)
}
# G1 ---------------------------------
f_G_mat_locus <- sapply(
l_G_mat_locus,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "G1"
)
f_G_mat_mRNA <- sapply(
l_G_mat_mRNA,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "G1"
)
f_G_mat_exon <- sapply(
l_G_mat_exon,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "G1"
)
f_G_mat_CDS <- sapply(
l_G_mat_CDS,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "G1"
)
f_G_mat_introns_filtered <- sapply(
l_G_mat_introns_filtered,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "G1"
)
# Q ----------------------------------
f_Q_mat_locus <- sapply(
l_Q_mat_locus,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "Q"
)
f_Q_mat_mRNA <- sapply(
l_Q_mat_mRNA,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "Q"
)
f_Q_mat_exon <- sapply(
l_Q_mat_exon,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "Q"
)
f_Q_mat_CDS <- sapply(
l_Q_mat_CDS,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "Q"
)
f_Q_mat_introns_filtered <- sapply(
l_Q_mat_introns_filtered,
filter_transfrags,
simplify = FALSE,
USE.NAMES = TRUE,
y = "Q"
)
gtf data.frame to retain
percentile-filtered gene_idsgtf data.frame to retain
percentile-filtered gene_ids
out$k4
$below_05
$below_10
$below_15
$below_20
$below_25
$below_30
$below_35
$below_40
$below_45
$below_50
$below_55
$below_60
$below_65
$below_70
$below_75
$below_80
$below_85
$below_90
$below_95
NA
gtfsgtfs
path_G_mRNA
[1] "outfiles_gtf-gff3/Trinity-GG/G_N/filtered/mRNA"